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Abstract. We examine the evolution of expression patterns and the organization 
of genetic information in populations of self-replicating digital organisms. Seeding 
the experiments with a linearly expressed ancestor, we witness the development of 
complex, parallel secondary expression patterns. Using principles from information 
theory, we demonstrate an evolutionary pressure towards overlapping expressions 
causing variation (and hence further evolution) to sharply drop. Finally, we compare 
the overlapping sections of dominant genomes to those portions which are singly 
expressed and observe a significant difference in the entropy of their encoding. 



1 Introduction 

Life on Earth is the product of approximately four billion years of evolution, 
with the vast majority of beginning and intermediate states lost to us forever. 
The exact details of how we evolved to become what we are may be impos- 
sible to ascertain for sure, but we may still be able to better understand the 
evolutionary pressures exerted on life, and from that reconstruct sections of 
the path our evolution is likely to have taken. 

Here we look at a fundamental issue to life as we know it; the organization 
of the genetic code and the differentiation in its expression. DNA is struc- 
tured into many distinct genes which can be concurrently active, transcribed 
and expressed in an asynchronous, (i.e., differentiated) manner. Extant living 
systems have evolved to a state where multiple genes influence each other, 
typically without sharing genetic material. It appears that in all higher life 
forms each gene has its own unique position on the genome, while the tran- 
scription products often interact with unique positions "downstream" . Those 
organisms which do exhibit overlapping expression patterns are mostly virii 
and bacteriophages |J. This suggests that genomes containing only purely 
localized, non-overlapping genes must have evolved later on [fj). 

Upon initial inspection, the reason for a spatially separated layout appears 
uncertain. A modular design may be quite common in artificially created cod- 
ing schemes such as computer programs, but, in fact, only reflects a designer's 
quest to create human- understandable structures. Evolution has no such in- 
centive, and will always exert pressure towards the most immediate solution 
given the current circumstances. A more compressed coding scheme, perhaps 
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with overlapping genes, would allow a sufficiently shorter code that would 
minimize the mutational load and hence be able to preserve its information 
with a higher degree of accuracy. Furthermore, such overlapping regions might 
be used for gene regulation. Why this is not much more common becomes 
clearer when we observe those examples from nature where these overlapping 
reading frames do exist, such as DNA phages j^] and eukaryotic viruses | Q . 
Even in these organisms only some sections of code overlap, but examination 
of those sections reveals that they contain little variation — almost all of the 
nucleotides are effectively frozen in their current state from one generation to 
the next . This occurs because for any mutation to be neutral in such a 
section of genetic code, it must be neutral to both of the genes which it would 
affect. Further, most of the mutations that occur in DNA which are neutral 
occur in the third nucleotide of a codon, as substitutions in that position are 
often synonymous. When overlapping genes have offset (out-of-phase) read- 
ing frames, however, the position of the third nucleotide in one gene maps to 
the first or second in the other, leaving no redundancy. 

We have investigated the development of genome organization and differ- 
entiation in digital organisms: populations of self-replicating computer-code 
living in a computer's memory. Such "Artificial Life" systems have proven to 
be useful test cases to investigate the biochemical paradigm because the com- 
putational chemistry the digital organisms are based on share Turing univer- 
sality with their biochemical cousins, i.e., just as any type of organism appears 
to be implementable in biochemistry, the digital organisms can in principle 
compute any (partially-recursive) function Q . Due to the ease with which ex- 
periments can be prepared, data can be gathered, and trials can be repeated, 
digital organisms present an important tool to study universal traits in the 
evolution and development of symbolic sequences. Differentiation in digital 
organisms was first investigated within the tierra architecture jl2 
we comment on those results below. 

For the present study, we have extended our avida system Q to allow for 
the expression of a second gene to occur in parallel. We then processed the 
evolution of 600 populations from a seed program to complex information- 
processing sequences for an average of over 9000 generations each. The 600 
trials were divided into four sets which differ in the length of the seed program, 
constraints on size evolution, and their ability to express multiple portions 
of code in parallel. All populations with a genetic basis allowing for the 
development of multiple threads learn to use them almost immediately (each 
thread is an instruction pointer which executes the code independently), but 
the methods by which this happens are quite distinct and varied. In the next 
section, we outline the most important design characteristics of the avida 
system, focusing mostly on the particular experimental setup needed for this 
study. Also, we outline the kind of observables which we record, and discuss 
measures of differentiation. In Section 3 we present results obtained with our 
multiple-expression digital chemistry and compare them to controls in which 



15 jj and 



Evolution of Genetic Organization in Digital Organisms 3 



no secondary expression was allowed. In Section 4 we study the evolution of 
differentiation for different experimental boundary conditions, while Section 
5 explores in more detail the organization and development of genes at the 
hand of an example. We close in Section 6 with a discussion of the evidence 
and conclusions, and issue caveats about applying the lessons learned directly 
to biochemistry. 

2 Experimental Details 
2.1 The Avida Platform 

The computer program avida is an auto-adaptive genetic system jjj] designed 
primarily for use as a platform in Artificial Life research. The system con- 
sists of a population of self-reproducing strings of instructions with a Turing- 
complete genetic basis subjected to Poisson-random mutations during repro- 
duction. The population adapts to a combination of an intrinsic fitness cri- 
terion (self-reproduction) and an externally imposed (extrinsic) fitness land- 
scape provided by the researcher by creating an information-rich environ- 
ment. 

A normal avida organism is a computer program written in a very simple 
assembly language, with 28 possible commands for each line (Table I). 



Table 1. Standard (single expression) avida instruction set 



Instruction type 


Mnemonic 


flow control 
conditionals 
self analysis 
computation 

metabolic 
I/O 
labels 


jump-b, jump-f, call, return 
if-n-eq, if-less, if-bit-1 
search-f, search-b 
shift-1, shift-r, inc, dec, swap 
swap-stk, push, pop, add, sub, nand 
alloc, divide, copy 

get, put 
nop-A, nop-B, nop-C 



These programs exist on a two-dimensional lattice with toroidal bound- 
ary conditions, and are executed on simple virtual CPUs residing at the 
lattice-sites which process their code allowing them to interact with their 
environment and perform functions such as self-replication, as well compu- 
tations on numbers which are found in the external environment. For more 
details on the virtual CPUs in avida, see JlO[ . 

In order to study the evolution of code expression, we have extended the 
instruction set of Table I to allow for more than one instruction pointer to ex- 
ecute a program's code. Within the biochemical metaphor, the simultaneous 
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execution of code is viewed as the concurrent expression of two genes, i.e., the 
chemical action of two proteins. The first new instruction allows a program 
to initiate a new expression: fork-th. Its execution creates a new instruc- 
tion pointer ("forking off a thread") which immediately executes the next 
instruction, while the original thread skips it. Thus, fork-th is the rough 
equivalent of a promoter sequence in biochemistry. In a sense, this secondary 
expression is rather trivial and leads to redundancy; if the second thread is 
not sufficiently altered by the instruction following the fork-th, it simply 
executes the identical code as the first thread in lock-step. Of course, we are 
interested in how the organisms use this redundancy as a starting point to 
diversify the expression. 

The second new instruction inhibits an expression: kill-th removes the 
instruction pointer which executed it, while the third addition id-th iden- 
tifies which pointer is currently executing the code, i.e., which pattern is 
currently being expressed. We expect the three commands together to be 
useful in the regulation of expression. In principle, more than two instruction 
pointers can be generated by repeated issuing of the fork-th command, but 
here we restrict ourselves to a maximum of two threads in order not to com- 
plicate the analysis. In nature, of course, complex genomes express hundreds 
of proteins simultaneously. 

As our experiments begin with a self-replicating program which does not 
use any of the multiple expression commands, the first question might be 
whether or not multiple expression will develop at all. In fact, it does almost 
instantly, as secondary expression (typically in the trivial mode mentioned 
earlier) appears to be immediately beneficial, perhaps in the same manner as 
simple gene doubling or a second promoter sequence. From here on, differenti- 
ation evolves, i.e., the two instruction pointers begin to adapt independently, 
to express more and more different code. Ultimately one might expect that 
each pointer executes an entirely different section of code, achieving local 
separation of genes and fully parallelized execution. The mode and manner 
in which this separation occurs is the subject of this investigation. 

Several hundred independent experimental trials and controls were ob- 
tained in this study, testing different experimental conditions. For each of 
these trials we keep a record of a variety of statistics, including the dominant 
genotype at each time step, from which we can track the progression of evolu- 
tion of the population, in particular by studying the details of its expression 
patterns. 

2.2 Basic Analysis Metrics 



In order to track the differentiation of the threads, we need to develop a 
means to monitor the divergence between the two instruction pointers roam- 
ing the genome. Also, to study the evolutionary pressures such as the mu- 
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tational load, we need to introduce some standard (and some less standard) 
observables which allow us to track the adaptability of the population. This is 
one of the major advantages of digital chemistries — some of the data that we 
collect is impossible to accurately obtain in biochemical systems, and even 
less practical to analyze. 

Fitness is measured as the number of offspring a genome produces per 
unit time, normalized to the replication rate of the ancestor. Thus, in all 
experiments the fitness of the dominant genotype starts at one and increases. 
Fitness improvements are due to two effects: the optimization of the gene for 
replication (the "copy-loop") leading to a smaller gestation time, as well as 
the development of new genes which accomplish computations on externally 
provided random numbers. These computations are viewed as the equivalent 
of exothermic catalytic reactions mediated by the expression products. We 
reward the accomplishment of all bit-wise logical operations performed on 
up to three numbers by speeding-up the successful organism's CPU at a rate 
commensurate to the difficulty of the computation. 

Fidelity is the probability for an organism to produce an offspring per- 
fectly identical to itself, i.e., the probability that the offspring is unaffected by 
mutations during the copy process. For pure copy-mutations (each instruction 
copied is mutated with a probability R c ) 

F = (i-R c y (i) 

where t is the organism's sequence length. In an adapting population, other 
factors can affect the fidelity and lead to low-fidelity organisms even while 
the theoretical fidelity is high. On the other hand, the development of error- 
correction schemes could increase the actual fidelity. 

Neutrality v is the probability that an organism's fitness is unaffected 
by a single point mutation in its genome. This is calculated by obtaining all 
possible one-point mutations of the examined genome, and processing each 
of them in isolation to determine fitness. The neutrality is then the number 
of neutral mutations divided by the total tested: 

A "'"" (2) 



£{D-1) ' 

where D is the number of different instructions in the digital chemistry, i.e., 
the size of the instruction set. 

The preceding three indicators are key in determining the ability of an 
organism to thrive in an avida environment. Fitness, fidelity, and neutrality 
correspond respectively to an organism's ability to create offspring, for those 
offspring to have a minimum mutational load, and for them to survive those 
mutations which they do bear. Apart from this, however, there is another 
aspect which is necessary for a phylogenetic branch to be successful, and 
that is its ability to further adapt to its environment. To characterize this, 
we define two more genomic attributes: 
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Neutral Fidelity is a measure which can be calculated once an organ- 
ism's neutrality is known. It is the probability that an organism will give 
birth to an identical or equivalent offspring. Taking f c = R c (\ — v) to be the 
probability for a line to be mutated and be non-neutral to the organism, we 
obtain the neutral fidelity as: 

Fncut = (1 - fcY • (3) 

Genomic Diffusion Rate is the probability for an offspring to have a 
genome different from its parent, but to be otherwise equivalent (i.e., neutral.) 
This is obtained by subtracting the genome's fidelity from its neutral fidelity 

D g = F DCUt - F . (4) 

This is a particularly important indicator as it is the rate at which new, 
viable genotypes are being created, which in turn is the pace at which genetic 
space is being explored, and therefore directly proportional to the rate of 
adaptation. 

2.3 Differentiation Measures 



The following measures and indicators keep track of code-differentiation. 
In biochemistry, the differentiation of expression can be very varied, and 
includes overlapping reading frames (in-phase and out-of phase), overlapping 
operons and promoter sequences, and gene regulation. Obviously, there are 
no reading frames in our digital chemistry, but it is possible for a sequence 
of instructions to give rise to a different computation depending on which 
thread is executing it, in particular if one gene contains another (as is very 
common in overlapping biochemical genes fllfj). Also, thread- identification 
may lead one thread to execute instructions which are skipped by the other 
thread, and threads may interact to turn each other on and off — a case of 
digital gene regulation. All such differentiation however has to evolve from 
the trivial secondary expression discussed earlier, and we consequently need 
to monitor the divergence of thread-execution with suitable measures. 

Expression Distance is a metric we use to determine the divergence 
of the two instruction pointers. Simply put, this measurement is the average 
distance (in units of instructions) between the sections of the genome actively 
being expressed by the individual threads. At the initial point leading to 
secondary expression, this distance is zero as the two threads execute the 
same code in lock-step. If this value is high relative to the length of the 
genome, it is a strong indication that the instruction pointers are expressing 
different sections of the genetic code at any one time, while if it is low, they 
most likely move together with identical (or nearly so) execution patterns. 
However, this measure only indicates the differentiation between execution at 
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a particular point in time, implying that if the execution is simply time-offset, 
this metric may be misleading. 

Expression Differentiation distinguishes execution patterns with char- 
acteristically differing behavior. Each execution thread is recorded with time, 
and a count is kept of how often each portion of the genome is expressed. 
The expression differentiation is the fraction of the genome in which those 
counts differ. Thus, the ordering of execution (time-delay) is irrelevant for 
this metric; only whether the code ends up getting expressed differently by 
one thread vs. the other is important. 



2.4 Information Theoretic Measures 



We use information theory in order to distinguish sequences which do or 
do not code for genes. In our digital chemistry, regions which do not code for a 
gene are either unexecuted, i.e., the instruction pointer skips over them, or else 
neutral implying that their execution will typically not affect the behavior of 
the program. Trivial neutral instructions often involve the nop instructions 
(see Table I) which perform no function on their own when executed, but 
do act to modify other instructions. Thus, even though their execution is 
neutral their particular value can still severely affect the functioning of the 
organism. A perfectly neutral position sports any of the D instructions with 
equal probability among a population of sequences, while a maximally fixed 
position can only have one of the D instructions there. To distinguish these, 
we define the 

Per-Site Entropy of a locus by trying out each of the D instructions at 
that position and evaluating the fitness of the resulting organisms. All neutral 
positions are assigned an equal probability to be expected at that site, while 
deleterious mutations are assigned a vanishing probability (as they would be 
selected against). Due to the uniform assignment of probabilities, the per-site 
entropy of locus (normalized to the maximum entropy log(£>)) is 

H(x .) - lQg^ncut(x t ) 

H{Xl) - \og{D) ■ (5j 

In an equilibrated population, this theoretical value of the per-site entropy is a 
good indicator for the actual per-site entropy, measured across the population 
(if the population is large enough). As positive mutations are extremely rare 
and we are only interested in the diversity of the population when it is in 
equilibrium, for the purposes of this measurement they arc treated as if they 
were neutral. An indicator for the randomness within a sequence is the 
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Per-Genome entropy, which we approximate by the sum of the per-site 
entropies 

i 

H = Y,H{ Xi ). (6) 

i 

The actual per-genome entropy is in fact smaller, as the above expression 
neglects epistatic effects which lead to correlations between sites. For most 
purposes, however, the sum of the per-site entropies is a good approximation 
for the randomness. Measuring the entropy of the population by recording 
the individual genomic abundances is fruitless as the sampling error is of the 
order of the entropy Q] . 

3 Single Expression vs. Multiple Expression 

Let us first examine adaptability as measured by the average increase in fit- 
ness for both single and multiple expression chemistries. In Fig. [i]A, the fitness 
is averag ed for the 200 trialsQ which were seeded with small (£ = 20) seed 
sequences and no size constraint (set I), for each of the chemistries. While the 
average increases relatively smoothly in timeQ, it should be noted that each 
individual fitness history is marked by periods of stasis interrupted by sharp 
jumps, giving rise to a "staircase" picture reminiscent of the adaptation of 
E. coli JH . During adaptation, the sequence length increases commensurately 
with the acquired information, as shown in Fig. |l|B. 

Clearly, the trials in which multiple expression is possible adapt more 
slowly than the single-expression controls, a behavior that may appear at 
first glance to be paradoxical as the only difference in the underlying coding 
of the multiple expression trials is an increased functionality However, as we 
have noted previously, the neutral fidelity of an organism directly determines 
the fraction of its offspring which are viable. As this value is inversely cor- 
related to the length of the genome, there is a pressure for the genomes to 
evolve towards shorter length. Normally, this pressure is counteracted by the 
adaptive forces which require the organism to store more information in its 
genome, requiring increased length. Overlapping expression patterns (here, 
multiple parallelized execution) allows this adaptation to occur while min- 
imizing the length requirement. Hence, multiple-expression genomes adapt 
more slowly. 

1 Each trial is seeded with a single ancestor, which quickly multiplies to reach the 
maximum number of programs in the population, set to 3,600 for these trials. 
The population was subjected to copy mutations at a rate of 7.5 x 1Q~ A per 
instruction copied, and a rate of 0.5% of single insert or delete mutations per 
gestation period. 

2 Time is measured in arbitrary units called updates. Every update represents the 
execution of an average of 30 instructions per program in the population. 



Evolution of Genetic Organization in Digital Organisms 



9 




Ql 1 1 1 1 1 1 1 1 1 1 

0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 

Updates [x10 4 ] 

0.06 1 1 1 1 1 1 1 1 1 1 1 




Fig. 1. (A): Average fitness as a function of time (in updates) for 200 populations 
evolved from I — 20 ancestors, their average sequence length (B) and the average 
genomic diffusion rate (C) for the single expression chemistry controls (solid line) 
and the multiple expression chemistry (dashed line). 



The pitfalls of compacting so much information into the same portion 
of the genome are illustrated in Fig. [l]C where we plot the average genomic 
diffusion rate D g for both chemistries. It is evident in this graph that ini- 
tially both sets of experiments explore genetic space at a comparable rate, 
but around approximately 5000 updates (on average) the diffusion rates di- 
verge markedly, followed by a corresponding divergence in the fitness of the 
organisms (that a higher diffusion rate leads directly to higher fitness in an 
information-rich environment is shown in pj.) Investigating the course of evo- 
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lution further, we see that it is precisely at this point that the differentiated, 
yet overlapping, use of multiple threads is typically established. 

To further implicate overlapping expression in reduced adaptation for 
the populations, we consider (as was done in Ref. for the bacteriophage 
(PX174)) the substitution rate of instructions for overlapping versus non- 
overlapping genes. The substitution rate in avida is equal to the neutrality (at 
equilibrium) . We find the substitution suppression (the neutrality in multiply 
expressed code divided by the neutrality in singly expressed code) to be 
between 0.53 and 0.56 for the three sets of trials (Table II), similar (but not 
quite as severe) as the suppression ratio of between 0.4 and 0.5 observed 
in the bacteriophages 0. This was to be expected, as there are no reading 
frames in avida which implies that two non-differentiated threads do not 
constrain the evolution any more than a single thread. When the instruction 
pointers do adapt independently and the threads differentiate, neutrality is 
compromised. Consequently, the instructions within sections of overlapping 
code are comparatively "frozen" into their state. 



Table 2. Average neutrality of the final dominant genotype: multiply-expressed 
code (column 1), singly expressed code (column 2), and their ratio (column 3), for 
200 populations grown from £ = 20 ancestors (variable length) [set I], 100 popula- 
tions grown from £ = 80 ancestors (variable length) [set II], and 100 populations 
grown from £ = 80 ancestors (constant length) [set III]. 



Set 


^mult 


^single 


ratio 


I 


0.109 


0.202 


0.539 


II 


0.197 


0.346 


0.569 


III 


0.082 


0.145 


0.566 



4 Evolution of Differentiation 

Let us now track the evolution of differentiation in more detail. We first ad- 
dress the de novo evolution of multiple expression, i.e., the development of 
multi-threading from linear execution. This question has previously been ad- 
dressed within tierra [jl2j, a population of self-replicating computer programs 
that served as the inspiration to our avida. In initial experiments, usage of 
multiple threads would not evolve spontaneously, but hand-written programs 
that had secondary expressions would evolve towards multiple expression |l5| . 
More recently, experiments were carried out within a network version of the 
tierra architecture, which showed that a program which used different instruc- 
tion pointers to execute different genes would not lose this ability The 
failure of multiple expression to evolve spontaneously in this system can be 
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tracked back to problems with tierra's digital chemistry and the lack of an 
information-rich environment [ p"T| . 

Within avida, the ability to use more than a single thread begins to de- 
velop within the first 5000 updates and is very common after about 10,000 
updates, depending on the experimental boundary conditions. Fig. ||A shows 
the (averaged) percentage of a program's lifetime in which more than one 
thread is active, for the populations of set I (solid line), set II (dashed line) 
and set III (dotted line). It is apparent that multiple expression develops 
much more readily in smaller genomes, due to the fact that the logistics are 
less daunting. 

In panels B and C of Figure ^ we display two indicators of differentiation 
(defined earlier), the expression distance and the expression differentiation, 
respectively. The expression distance appears to be sensitive to the experi- 
mental starting condition, as set II and set III show a value over twice that 
of set I. We observe that this is due to the small size of the ancestor used 
in set I: as that ancestor develops threading very quickly, it loses adaptabil- 
ity earlier and lags both in average fitness and average sequence length. In 
fact, those averages are dragged down by a significant percentage of the trials 
in set I which were stuck in an evolutionary dead-end. Set II and III were 
seeded with an ancestor of length I — 80 and did not suffer from this lot. 
Fig. shows the expression differentiation, i.e., the fraction of code that is 
executed differently by the two threads. This fraction is less dependent on ex- 
perimental conditions, and the genomes appear to develop towards 0.5. Note, 
however, that this measure cannot accurately reflect differentiation which is 
more subtle than threads executing particular instructions a different num- 
ber of times. For example, two threads which execute a stretch of code in 
an identical manner but that start execution at different points "upstream" 
may end up calculating very different functions, and thus have quite different 
behaviors. This difference will thus be underestimated. While the preceding 
graphs seem to indicate that differentiation stops about half-way through the 
duration we record, this is actually not so, as the more microscopic analysis 
of the following section reveals. Finally, Fig. |^ shows the evolution of the 
fraction of code that is executed by multiple threads. 

We anticipate that this fraction rises swiftly at first, but then levels off, 
as it is not advantageous to multiply express all genes (see below). However, 
we might anticipate that the fraction would start to decline at some point, 
when the organism develops the ability to localize its genes and use indepen- 
dent instruction pointers for each of them. We do not witness this trend in 
Fig. presumably because there is no cost associated with the development 
of secondary expression. This should be viewed as a peculiarity of the digital 
environment rather than a universal feature, which we hope to eliminate with 
future refinements of the avida world. 
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Fig. 2. Differentiation measures. (A): Average fraction of lifetime spent with sec- 
ondary expression, as a function of time (in updates), (B): average expression dis- 
tance, (C): average expression differentiation. Set I (solid line), set II (dashed line), 
and set III (dotted line). 



5 Evolution of Genetic Locality 



To get a better idea of how evolution is acting upon programs harboring 
multiple threads, we must look at exactly what is being expressed. We can 
loosely characterize all organisms by tracking three separate genes. They are 
"self-analysis" (slf), "replication" (rpt) , and " computation" (cmp). To follow 
the progression of these genes through time, we examine a sample experiment 
seeded with an ancestor of size 80 (as before, capable only of self-replication), 
in an environment in which size-altering mutations are strictly forbidden (a 
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Fig. 3. Average fraction of doubly expressed code for the three experimental sets. 
Solid line: set I, dashed line: set II, dotted line: set III. 



trial from set III). This limitation was enforced in order to better study the 
functionality of the organism and the location of its genes. Similar studies 
have been done with all 400 trials used to collect the bulk of the data for this 
report, showing comparable behavior. 

In Fig. |^A we follow the per-site entropies for each locus as a function of 
time. Positions are labeled by 1 to 80 on the vertical axis, while time proceeds 
horizontally. A grey-scale coding has been employed to denote the variabil- 
ity of each locus, where the white end denotes more variable positions and 
the dark end more fixed positions. Because the per-site entropies have been 
calculated by obtaining the frequency with which each instruction appears 
at that locus within the population (as opposed to the theoretical estimate 
based on neutrality), major evolutionary transitions are identifiable by dark 
vertical bands. Fig. shows which portion of the code is expressed by which 
pointer, by two pointers simultaneously, or not at all. 

The first gene slf uses pattern matching on nop instructions in order to find 
the limits of its genome and from that calculate it's length. This value is used 
for elongation (via the command alloc), which adds empty memory to the 
genome and prepares it for the "execution" of the replication gene. Note that 
avidian genomes are circular. There are two interesting points to note about 
the evolution of slf. First, there are many methods by which the organism can 
determine its own genomic length, so this gene tends to vary widely. Most 
of the time the organism keeps pattern matching techniques, but matches 
different portions of the code. However, often an organism shifts to purely 
numerical methods performing mathematical operations upon itself which 
yield the genome length "by accident" . The other evolutionary characteristic 
of this gene is that there is no benefit in expressing it multiple times as it 
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Fig. 4. (A): Per-site entropy for each locus as a function of time for a standard (set 
III) trial. Random (variable) positions with near-unit per-site entropy are bright, 
while "fixed" instruction with per-site entropy near zero are dark. (B): Thread 
identification within a genome. Black indicates instructions which are never directly 
executed, dark grey denotes instructions executed by a single thread when no other 
thread is active, while sections which are executed by a single thread while another 
thread is executing a different section are colored in lighter shades of grey. Sections 
with overlapping expression are in white. 



has a fixed result which needs only be applied once during the gestation 
cycle. Looking at Figure [|, the slf gene initially spans from lines 44 to 61 
plus the first four lines and last four lines of the genome which are boundary 
markers fashioned from nop instructions. The first major modification to the 
slf gene occurs around update 3000. The pattern used to mark the limits of 
the genome is a series of four nop- A instructions. As a newly allocated genome 
has all of its sites initialized to nop-A, the genome is re-organized such that 
these lines are no longer copied. This reduces the possibility of variation in 
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these sections of code to zero. This is apparent in Fig. |4|A as the positions of 
these limit patterns become completely black indicating vanishing entropy. 

The sZ/gene is continuously undergoing minor changes as is becomes more 
optimized to require fewer lines of code to perform its function. Near update 
13,000 it shifts dramatically and is replaced by one in which size is calculated 
using only the final boundary markers. The distance from the gene to the final 
marker is determined, and then manipulated numerically in order obtain the 
number which is the size of the organism. Looking at the first four lines 
of Fig. ||A around this update, we see that they are slowly phased out and 
increase in entropy as they are no longer as critical to the organism's survival. 
Finally the size of the pattern marking the end boundary of the organism is 
shortened until it becomes only a single line. By the end of the evolution 
shown, the s//gene only occupies lines 48 through 56. Note that all of these 
lines are only expressed a single time. 

The next gene under consideration is the actual replication gene rpl. This 
sequence of instructions uses the blank memory allocated in the self-analysis 
phase and enters a 'copy-loop' which moves line by line through the genome, 
copying each instruction into the newly available space. When this process 
is finished, it severs the newly created copy of itself which is then placed in 
an adjacent lattice site. These dynamics spawn off a new organism which, 
if the copy process was free of mutations, would be identical to the parent. 
In Fig. ^, the organism being tracked has its replication gene on lines 65 to 
71 until update 24,000 at which time this gene actually grows an additional 
line becoming much more efficient by "unrolling" its copy-loop. What this 
means is that it is now able to copy two lines each time through the loop. 
From the dark color of these lines, it is obvious that they have very low 
entropy, and are therefore very difficult to mutate. The copy-loop is a very 
fragile portion of code, critical to the self-replication of the organism, yet 
we do see some evolution occurring here when multiple threads are in use. 
Often the secondary thread will simply "fall through" the copy-loop (not 
actually looping through to copy the genome) and move on to the next gene, 
while the other thread performs the replication. However, sometimes the two 
threads will actually evolve together to use the copy loop in different ways, 
with each thread copying part of the genome. In Fig. ^, most of the rpl 
gene is executed by only one thread. The rpl gene is followed by junk code 
which, while executed sporadically, does not affect the fitness in any way (as 
evidenced by the light shading in Fig. [|A for these lines). 

The most interesting of the genes is the computation gene cmp. The an- 
cestor does not possess this gene at all, so it evolves spontaneously during the 
adaptive process. There arc 78 different computations rewarded in this envi- 
ronment, all of which are based on bit-wise logical operations. The organisms 
have three main commands which they use to accomplish those: a get in- 
struction which retrieves numbers from the environment, a put instruction to 
return the processed result, and a nand instruction which computes the log- 
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ical operation not-and (see Table I). Any logical operation can be computed 
with a properly arranged collection of nand instructions. 

The cmp gene(s) evolve uniquely in each trial, enabling the organisms 
to perform differing sets of tasks. There are, however, certain themes which 
we see used repeatedly whereby the same section of code is used by both 
threads, but their initial values (i.e., the processing performed thus far on 
the inputs) differs. Consequently, this section of code performs radically dif- 
ferent tasks, actually encouraging this overlapping. Portions of this algorithm 
which might have some neutrality for a single thread of execution will now 
be frozen due to the added constraints imposed by a secondary execution. 
The size of cmp grows during adaptation as a number of computations are 
performed, and the gene is almost always expressed by both threads as this 
is always advantageous. In Fig. ||, the cmp gene stretches from line 1 to line 
42 (at update 30,000), while it is considerably smaller earlier. Furthermore, 
the genome manages to execute the entire gene by both threads (the tran- 
sition from single expression of part of cmp to double expression is visible 
around update 20,000). This gene ends up being expressed many times (as 
the instruction pointers return to this section many times during execution) . 
All in all, 17 different logical operations are being performed by this gene. 

By the end of the evolution tracked in Fig. ^, most of the genes appear 
to occupy localized positions on the genome. The cmp gene (white sections 
in Fig. ^) is revisited many times by both threads with differing initial con- 
ditions for the registers, allowing the genome to maximize the computational 
output. In the meantime, those sections have become fixed (their variability 
is strongly reduced) as witnessed by their dark shading in Fig. [|A. 

6 Discussion and Conclusions 

The path taken by evolution from simple organisms with few genes towards 
the expression of multiple genes via overlapping and interacting gene prod- 
ucts in complex organisms is difficult to retrace in biochemistry. Artificial 
Life, the creation of living systems based on a different chemistry but us- 
ing the same universal principles at work in biochemical life, may help to 
understand some key principles in the development of gene regulation and 
the organization of the genetic code. We have examined the emergence and 
differentiation of code expression in parallel within a digital chemistry, and 
found some of the same constraints affecting multiply expressed code as those 
observed in the overlapping genes of simple biochemical organisms. For ex- 
ample, multiply expressed code is more fragile with respect to mutations 
than code that is "transcribed" by only one instruction pointer, and as a 
result evolves more slowly. During most stages of evolution, two constraints 
are most notable: the pressure to reduce sequence length in order to lessen 
the mutational load, and the pressure to increase sequence length in order 
to be able to store more information. Simple organisms can give in to both 
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pressures by using overlapping genes, gaining in the short term but mortgag- 
ing the future: the reduced evolvability condemns such organisms to a slower 
pace of adaptation, and exposes them to the risk of extinction in periods of 
changing environmental conditions. 

This trend is clearly visible in the evolution of digital organisms, as is a 
trend towards multiple expression of as much of the code as possible. This 
latter feature we believe not to be universal, but rather due to the fact that 
multiple expression in avida is cheap, i.e., no resources are being used in order 
to express more code. In a more realistic chemistry, this would not be the 
case: adding an instruction pointer should put some strain on the organism 
and use up energy; in such circumstances multiple expression would only 
emerge if the advantage of the secondary expression outweighs the cost of 
it. We also expect more complex gene regulation in such an environment, as 
genes would be turned on only when needed. 

Still, under extreme conditions we believe that multiple overlapping genes 
are a standard path that any chemistry might follow. Even though evolution 
slows down, such organisms can be rescued either by the development of 
error-correction algorithms, or an external change in the error rate. In either 
drastic reduction of the mutational load would enable the sequence 
length to grow and the overlapping genes to be "laid out" (for example by 
gene-duplication). The corresponding easing of the coding constraints might 
give rise to an explosion of diversity and possibly the emergence of multi- 
cellularity. 
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